server: preserve primary KV cache when MTP companion trim fails by localweights · Pull Request #1889 · ikawrakow/ik_llama.cpp

localweights · 2026-05-26T21:27:39Z

Summary

The pre-batch reset path in server_context partially trims both the target ctx and (when MTP is enabled) its speculative companion ctx at p0 = system + n_past. The existing logic treats either trim failure as a reason to nuke cache_tokens, slot.n_past, n_prompt_tokens_cache, the checkpoint list, and reset the sampler.

Companion failures happen routinely after generation: unvalidated draft tokens leave the companion KV's position layout out of sync with primary. Sacrificing the primary cache for a recoverable mismatch confined to the draft ctx forces a full re-prefill on the next request, defeating the entire point of prefix caching when MTP is on.

Change

Split the fallback into two paths:

target_trimmed && !companion_trimmed → wipe only the companion (it repopulates during the next prefill); leave the primary cache + checkpoints + sampler state intact.
!target_trimmed → unchanged conservative full reset (the original non-Transformer fall-through case that the existing comment alludes to).

Validation

Tested on Qwen3.6-27B + --multi-token-prediction --draft-max 3 + --reasoning on. Combined with #1888 (qwen3next checkpoint reuse), multi-pass synthesis goes from 0% prefix-cache reuse to 92% reuse on shared-prefix follow-up calls. Without this patch the companion-trim failure path still wiped the primary cache and undid the checkpoint fix.

Note

This patch addresses the pre-batch reset site that exists in current main. PR #1877 (Fix prompt cache viability) introduces a similar trim-fallback at the second post-prefix-match site; the same split should be applied there when that PR lands.

The pre-batch reset path in server-context partially trims both the target ctx and (if MTP is enabled) its speculative companion ctx at p0 = system + n_past. Either failure currently triggers a full reset that nukes cache_tokens, slot.n_past, n_prompt_tokens_cache, the checkpoint list, and the sampler state. Companion failures are common after generation because unvalidated draft tokens leave the companion KV's position layout out of sync with the primary's. Sacrificing the primary cache for that recoverable mismatch forces a full re-prefill on the next request, even though the primary KV trim succeeded. This change splits the fallback: when only the companion fails, wipe just the companion (it repopulates during the next prefill) and keep the primary cache + checkpoints intact. The full-reset path remains in place for when the primary itself fails to trim (non-Transformer fall-through case the comment alludes to). Validated on Qwen3.6-27B + --multi-token-prediction --draft-max 3: 92% prefix-cache reuse on multi-pass synthesis vs 0% before this change.

ikawrakow · 2026-05-27T04:43:54Z

Can you provide a reproduction where trimming one context succeeds but trimming the other fails?

ikawrakow · 2026-05-28T06:10:28Z

Add an issue with reproduction. After that you can resubmit the PR.

ikawrakow closed this May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: preserve primary KV cache when MTP companion trim fails#1889

server: preserve primary KV cache when MTP companion trim fails#1889
localweights wants to merge 1 commit into
ikawrakow:mainfrom
localweights:fix-mtp-companion-preserve-cache

localweights commented May 26, 2026

Uh oh!

ikawrakow commented May 27, 2026

Uh oh!

ikawrakow commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

localweights commented May 26, 2026

Summary

Change

Validation

Note

Uh oh!

ikawrakow commented May 27, 2026

Uh oh!

ikawrakow commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants